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On  the  Pictorial  Structure  of  Chinese  Characters 

by 

B.    Kirk  Rankin  III 
Walter  A.    Sillars 
Robert  W.    Hsu 

A  grammar  of  radical  combination  in  Chinese 
characters  has  been  written.      From  a  sample   study, 
it  appears  that  this  grammar  is  powerful  enough  to 
generate  80%  of  the  characters  in  one  of  the  standard 
dictionaries.      Research  leading  to  the  construction  of 
this  grammar  is   embedded  in  a  larger  framework  for 
describing  the  pictorial  structure  of  the  characters  in 
detail.      The  descriptive  framework  defines  five  areas 
of  study  within  the  overall  study  of  the  characters: 
radical  combination,    radical  variation,    stroke  com- 
bination,   stroke  variation,    and  distinctive  features   of 
strokes . 

A.  Extended  Notions   of  Grammar  and  Language 

Modern  linguistics  is   concerned  -with  the   systematic  study  of 
(natural)  languages.      It  typically  involves  the  discovery  and  analysis   of 
certain  regularities  of  various  kinds,    and  the  expression  of  these  (and 
of  their  interrelationships)  in  the  form  of  grammars  for  those 
languages.      Some  types  of  regularities   studied  by  linguistics  are  the 
relations  among  sentences;  the  constituent  structure  of  sentences  in 
terms  of  words;  word-formation;  the  constituent  structure  of  words  in 
terms  of  sound  units;  and  the  specification  of  sound  units  in  terms  of 
sound  features.    — 


u 

Those  interested  in  detailed  accounts  of  modern  linguistics  might 

consult  [  l]   and  [  2]  . 
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It  is  our  contention  that  sets  of  data  other  than  natural  languages 
show  many  of  the  same  features  and  regularities  as  those  that  are 
found  in  natural  languages.     Some  very  preliminary  work  on  the  analy- 
sis of  chemical  structure  diagrams  and  electronic  circuit  diagrams 
has  indicated  that  at  least  the  linguistic  notion  of  constituent  structure 
is  quite  valuable  in  describing  rather  complex  diagrams  of  both  types. 
One  other  set  of  data  which  shares  with  these  two  types  of  diagrams 
both  their  "two  dimensional"  nature  and  their  tractability  to  linguistic 
description  is  the  set  of  Chinese  characters.      The  linguistic  analysis 
of  this  set  is  our  concern  in  this  report.    — 

To  summarize:     We  view  the  set  of  characters  as  a  language. 
This  language  shows  many  regular  features  similar  to  features  found 
in  natural  languages.      Accustomed  to  constructing  grammars  to  exploit 
and  explain  these  regularities  in  natural  languages,    we  construct  a 
grammar  to  account  for  corresponding  phenomena  in  the  language  of 
Chinese  characters. 

B.       The  Descriptive  Framework 


What  follows  is  a  general  outline  of  a  framework  for  studying 
Chinese  characters  from  a  linguistic  point  of  view.      The  framework  or 
model  is  inferred  from  two  sources:     Hockett  [3]   and  Lamb  [4].     We 


1/ 


Let  it  be  understood  that  the  set  of  characters  we  describe  is 
more  or  less  abstract.      What  we  describe  is  the  manual  output 
of  a  hypothetical  native  writer  of  Chinese  as  he  prints  characters, 
We  do  not  describe  either  cursive  handwriting  or  actual  printed 
specimens  from  some  font. 


are  not  committed  to  this  particular  framework  of  description.   _/  It 
simply  offers  us  what  we  want:     a  convenient  framework  for  presenting 
our  research  results;  in  the  long  run,    it  may  turn  out  to  be  inadequate 
or  in  some  sense  undesirable.      The  model  imposes  the  following  stair^ 
way  structure  on  our  language  of  Chinese  characters: 

CHARACTER  —  C  — RADICAL 

R 

I 
RADICAL  —  C  —STROKE 
VARIANT  | 

R 

I 

C  =  "is  composed  of"  STROKE  DISTINCTIVE 

R  =  "is  represented  by"  VARIANT  —  C  —  COMPONENT 

C.      Characters  are  Composed  of  Radicals 

Ours  is  not,    by  any  means,    the  only  way  of  describing  the   struc- 
ture of  Chinese  characters  in  terms  of  radical-combination.      The  sys- 
tem in  most  common  use,    by  both  Chinese   scholars  and  Western  sino- 

2/ 
logists,    is  that  of  classification  according  to  radical  and  phonetic  — 


1/ 


2/ 


And  we  specifically  do  not  pretend  to  speak  officially  for  the 
"Stratificational"   school  of  linguistics,    -which  appears  to  take 
as  basic  to  their  theory  the  two  sources  just  cited. 

The  traditional  approach  described  here  is  essentially  a  means  of 
getting  into  a  dictionary.      It  does  not  pretend  to  be  a  detailed  study 
of  character  structure  such  as  we  are  offerring. 


Traditional  Chinese   scholarship  recognizes  214  distinguished  charac- 
ters  called  radicals,    some  of  which  have  variant  forms.      (See  the  list 
displayed  in  Appendix  A.  )     Each  character  is  either  itself  a  radical  or 
may  be  decomposed  into  a  radical  plus  a  residue  which  is  generally 
called  the  phonetic. 

The  ideal  to  -which  the  traditional  system  aspires  is  a  quite   simple 
system.     When  a  character  is  decomposed  into  radical  and  phonetic, 
the  radical  gives  a  clue  to  the  meaning  of  the  character  and  the 
phonetic  a  clue  to  its  pronunciation.      The  traditional  algorithm  which 
identifies  the  radical  in  the  majority  of  cases  is  as  follows: 

If  the  character  naturally  splits  into  a  left  side  and  right  side, 
the  left  side  is  the  radical;  otherwise,    if  the  character 
naturally  splits  into  a  top  and  a  bottom,    then  the  bottom  is 
the  radical. 

However,  there  are  many  characters  for  which  this  procedure  does 
not  give  the  correct  radical.  These  characters  may  be  divided  into 
two  classes: 

1.  Those  to  which  the  algorithm  does  not  apply,    since  the 
character  does  not  have  either  a  natural  left- right  or  top- 
bottom  split,    for  example,  yff\    ,    whose  radical  is  P        , 
and 

2.  Those  for  which  the  algorithm  determines  a  radical  other 
than  the  correct  one,    for  example,  /^0     ,    whose  radical  is 

@   • 

A  more  elaborate  algorithm  for  identifying  the  radical  is  described 
in  Chao  [  5]  ,    Mathews  [  6]  ,    and  many  other  sources,    including  nearly 
every  Chinese-English  dictionary.     It  runs   somewhat  as  follows: 

1.  If  the   character  is  itself  a  radical,    then  it  is  its  own  radical. 
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2.  If  the  character  has  a  natural  left- right  split  and 

a.  If  the  right  side  is  one  of  the  small  number  of  radicals 
■which  can  occur  on  the  right,  then  the  right  side  is  the 
radical;  otherwise, 

b.  If  the  left  side  is  a  radical,    then  the  left  side  is  the 
radical;  if  not,    use  the  "look-up"  procedure  in  step  5. 

3.  If  neither  1  nor  2  applies,    and  if  the  character  has  a  natural 
top -bottom  split  and 

a.  If  the  top  is  one  of  the  small  number  of  radicals  which 
can  occur  on  the  top,    then  the  top  is  the  radical; 
otherwise, 

b.  If  the  bottom  is  a  radical,    then  the  bottom  is  the  radical; 
if  not,    use  the  "look-up"  procedure  in  step  5. 

4.  If  none  of  the  above  apply,    use  the  procedure  for  determining 
the  radical  of  a  character  whose  natural  split  is  into  a 
border  and  an  inside,    or  which  is  formed  by  the   superimposi- 
tion  of  two  or  more  parts.      If  this  procedure  determines  the 
radical,    the  job  is  finished;  if  not, 

5.  Consult  a  table  of  characters  "whose  radicals  are  difficult  to 
determine  by  inspection. 

There  seem  to  be  at  least  two  obvious  weak  points  in  this  algorithm, 
namely: 

1.  The  procedure  to  be  followed  in  step  4  is  not  explicit;  for 

example,    the  radical  of  [jd    is  \  ,    while  that  of  \^n      is 

a 

When  we  consider  in  addition  to  these  the  fact  that  the  list  of  difficult 


characters  —    in  Mathews  includes  over  10%  of  the  total  number  of 
characters  in  this  dictionary,    it  is   easy  to  see  why  the  traditional 
analysis  of  characters  is  cumbersome  to  use  when  searching  a 
dictionary  for  an  unfamiliar  character. 

We  feel  that  the  procedure  of  assigning  to  each  character  an 
associated  character  chosen  from  a  list  of  distinguished  characters 
is  a  useful  tool  for      classifying  the  characters;  however,   we  also  feel 
that  the  defects  in  the  traditional  classification  scheme  make  that 
particular  system  unfeasible  for  mechanical  recognition  purposes  and 
not  particularly  desirable  for  other  purposes.      Therefore,   we  have  in 
Appendix  B  a  list  which,    when  finished,    is  intended  to  replace  in  our 
work  the  traditional  list  of  214  radicals.      This  list  has  been  arrived 
at  by  observing  combinations  of  strokes  which  recur  often  enough  to 
prompt  us  to  recognize  them  as  "linguistic"  units. 

We  have  written  a  grammar  similar  to  Chomsky's  phrase- 

2/ 
structure  grammars  —     which  seems  to  be  capable  of  generating 

approximately  80%  of  the  characters  in  Mathews  [6]  .      This  grammar  is 

exhibited  in  Appendix  B.      The  manner  in  which  the  grammar  operates, 

along  with  some  of  the  difficulties  we  have  encountered,    is  illustrated 

by  a  small-scale  grammar  DZ4  given  as  follows: 


u 
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That  is  those  characters  whose  radicals  are  not  obvious 


See  Chapter  4  in  [7] 


The  grammar  DZ4  consists  of  a  finite  number  of  rewrite  rules. 
These  rules  are  grouped  into  two  classes,    syntactic  and  lexical.      The 
syntactic  rules  have  the  form  A  =  X  or  A  =  a  (X,  Y)  where  A,    X  and  Y 
are  nodes  of  the  grammar  and  a  is  one  of  the  three  symbols  h,    v,    or  s. 
The  lexical  rules  have  the  form  A  =  R  where  A  is  a  node  and  R  is  a 
radical.      For  our  purposes  a  node  is  any  symbol  which  occurs  as   a 
constituent  on  the  left  side  of  some  rule  (either  syntactic  or  lexical), 
and  a  radical  is  any  symbol  which  occurs  on  the   right  side  of  a  lexical 
rule.      There  are  no  symbols  which  are  both  nodes  and  radicals.      The 
syntactic  rules  correspond  to  those   rules  in  a  phrase- structure 
grammar  which  generate  from  the  top  node  down  to  the  immediate  sub- 
terminal  level;  the  lexical  rules   correspond  to  those  which  generate 
the  terminal  symbols. 

We  use  the  customary  left-to-  right  expansion  convention,    except 
that  all  syntactic  rules  are  applied  before  any  lexical  rules.      The  rules 
of  the  grammar  DZ4  are  as  follows: 

Syntactic  Rules 

CHAR    =  COMP;  IT  - 

COMP   =  v  (N,  S);  h  (W,  E);  s  (O,    CHAR) 

N  =  COMP;  NT 

S  =  COMP;  ST 

W  =  COMP;  WT 

E  =  COMP;  ET 

O  =  OT 

1/  The  semicolon  separates  optional  subrules;  this  line  is  really  an 

abbreviation  for  the  two  rules:     CHAR  =  COMP  and  CHAR  =  IT. 


1/ 


{ 


1/ 
a.  - 


a1' 


n 


Lexical  Rules 
WT  ET        NT  ST        OT  IT 


x 


x 


Of  course  these  two  radicals  may,    in  the  long  run,    prove  to  be 
variants  of  a  single  radical.     However,    that  fact  need  not  prevent 
us  from  using  them  in  a  sample  grammar  as  two  different 
radicals. 


CHAR  (=  Character)  is  the  initial  or  top  node  of  the  grammar. 
CHAR  can  be  rewritten  as  COMP,    which  eventually  gives  a  complex 
(i.e.  ,    multi- radical)  character.     It  can  also  be  rewritten  as  IT,    a 
symbol  which  heads  a  column  in  the  lexical  rules,    the  column  consist- 
ing of  terminal  symbols  which  can  occur  in  isolation.      COMP  can  be 
rewritten  as  a  vertical  array  having  top  N  (North)  and  bottom  S  (South), 
a  horizontal  array  having  left  W  (West)  and  right  E  (East),    or  a 
surrounding  array   having    border  O  (Outside)  and  inside  CHAR  (any 
character).     N  can  be  rewritten  either  as  COMP  (note  that  this  option 
introduces  a  potentially  infinite  loop  into  the  grammar)  or  as  NT,    the 
heading  of  a  column  of  terminal  symbols  in  the  lexical  rules  which 
can  occur  on  the  top  of  a  vertical  array.      The  situation  is  similar  for 
the  remaining  rules  except  that  the  border  of  a  surrounding  display 
cannot  consist  of  a  complex  character. 

Some  examples  of  characters  which  DZ4  can  generate  are  \    j    , 

I  ill      '    ~^3s»    '     1r  f      '    anc*  rtii  '      Although  not  all  of  these  are  in  fact 
in  use,    they  are  all  well-formed  and  acceptable  to  informants. 


The  steps  in  the  generation  of  -ijj/    are: 

Syntactic  Steps 
CHAR 

COMP 


W,E) 

WT,E) 

WT,    COMP) 

WT,    s  (O,    CHAR)) 

WT,    s  (OT,    CHAR)) 

WT,    s  (OT,    COMP)) 

WT,    s  (OT,    v  (N,  S))) 

WT,    s  (OT,    v  (NT,    S))) 

WT,    s  (OT,   v  (NT,    ST))) 


Lexical  Steps 
s  (OT,    v  (NT,    ST))) 
s  (    P     ,    v  (NT,    ST))) 
s  (     Q     ,    v  (    +     ,    ST))) 

b  (  EJ  ,  v  (  +   ,    a    ))) 

We  will  now  discuss  some  currently  unresolved  problems  con- 
cerning our  grammar,    specifically  as  they  relate  to  the  grammar  dis- 
played in  Appendix  B  although  most  of  our  examples  are  chosen  from 
DZ4. 


i 
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First  is  the  question  of  recursion.      Recall  that  in  DZ4  there  is  a 
potentially  infinite  loop  through  the  node  COMP.      This  has  the  effect  of 
asserting  that  there  is  no  upper  bound  on  the  size  of  the  characters,    so 
that  \      S  is  just  as  well-formed  as  \      \     ,    "^  "H     ,    etc 


y  v^ 


We  are  not  satisfied  with  this  feature  of  the  grammar,    but  neither  are 
we  pleased  with  the  thought  of  putting  an  arbitrary  upper  bound  on  the 
number  of  recursions  permitted. 

The  second  problem  is  actually  an  extreme  case  of  the  first. 
There  is  nothing  in  the  grammar  to  prevent  an  arbitrarily  long  re- 
cursion process  which  generates  exclusively  horizontal  or  exclusively 
vertical  strings.      That  is,    the  grammar  contains   sub-grammars  like 


which  give  rise  to  terminal  arrays  like    tZj^jrJii-  \i~  ^fj  which  does 
not  satisfy  our  intuition  as  to  what  ought  to  constitute  a  well-formed 
character.      Our  tentative  conclusion  is  that  the  grammar  should  have 
a  device  associated  with  it  that  would  force  an  alternation  between 
horizontal  and  vertical  rules.     For  those  relatively  few  characters 
which  allow  more  than  one  successive  application  of  a  horizontal  (or 
vertical  rule)  a  list  would  have  to  be  constructed. 

Third,    the  grammars  (both  the  one  in  Appendix  B  and  DZ4) 
generate  some  characters  in  more  than  one  way.     In  the  appended 
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grammar,    for  example,      &&Y      ^s  generated  both  as  a  combination  of 
"=  «        and    -7+-    and  as  a  corabination  of     ■"="      and  j3-4-     . 

There  is  strong  motivation  for  multiple  generations  in  a  grammar 
for  generating  sentences  of  a  natural  language.     In  such  cases  we  say 
that  the  given  sentence  is  syntactically  ambiguous.      There  will,    in 
general,    be  one  semantic  interpretation  for  each  syntactic  parse  of  a 
sentence.      To  illustrate,    consider  the  following  two  possible  deriva- 
tions of  the   sentence 

They  are  flying  planes 
in  some  phrase-structure  grammar  for  a  subset  of  English: 

1.  TOP 


They 


2. 


Ax 

are  flying 
TOP 


p  lanes 


They 


are 


flying 


planes 


However,    we  are  far  from  certain  as  to  what,    if  anything,    is  the  analog 
to  diverse  semantic  interpretations  in  syntactically  ambiguous 
characters. 

At  this  point  we  also  call  the  reader's  attention  to  two  features  of 
our  grammar  which  we  consider  to  be  defects  and  to  indicate  how  they 
will  be  overcome  in  a  future  version.      These  features  are 

(a)  The  apparently  arbitrary  assignment  of  certain  radicals  to 
certain  geometric  positions,    for  example,    the  assignment 
of  J       to  the  north,    and 

(b)  The  listing  as  single  terminals  of  characters  whose  im- 
mediate constituent  structure  cannot  be  expressed  in  terms 
of  the  grammar,    for  example  J&    . 
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Consider  the  radical  J~~  ,    for  example.      Within  the  limits  of  the 
types  of  geometric  structures  handled  by  our  grammar,    there  are  three 
possible  assignments  of    J       to  a  terminal  class,    namely,    north,    west, 
and  outside.      We  (somewhat,    but  not  entirely  arbitrarily)  assign      1 

to  the  north,    J to  the  south,    and,    in  general,    when  a  radical 

includes  a  vertical  and  a  horizontal  side  we  assign  it  to  the  horizontal. 
Besides  having  the  advantage  of  making  a  definite  decision,    this  is 
probably  a  more  natural  choice  than  either  of  the  others,    since  it 
seems  that  the  class  of  radicals  which  can  occur  with,    say,      J  ,    has 

more  in  common  with  those  that  can  occur  with  radicals  which  beyond 
doubt  belong  to  the  north  terminal  class,    such  as    v    '    ,    than  it  has  with, 

say,    those  which  can  occur  inside   I J     .      For  similar  reasons,    [1 

is  treated  as  an  outside  rather  than  as  two  separate  radicals,    one  west 
terminal  and  one  east  terminal. 

A  still  better  solution,    which  is  in  the  process  of  being  incorpo  - 
rated  into  the  grammar,    consists  of  an  increase  in  the  number  of 
basic  operations  and  a  refinement  of  the  co-occurrence  classes.      In 
the  next  version  of  the  grammar  we  will  consider  /cfj    ,     \$*     >    /Zj 
and  f*jg    to  be  generated  by  four  different  binary  operations,    each 
having  its  own  distinct  set  of  co-occurrence  restrictions. 

The  fact  that  we  do  not  partition  y*L     into  those  constituents, 
namely,     j^N      a-n(^L    ^3   ,    of  which  it  is  traditionally  held  to  be  composed, 
is  explained  by  the  fact  that  the  grammar  does  not  have  a  superim- 
position  operation.      However,    this  is  not  so  much  a  justification  of  our 
choice  as  an  admission  that  the  grammar  is  incomplete.      An  examina- 
tion of  the  300  characters  in  Wang  [8]   indicates  that  up  to  10%  of  these 
would  be  more  naturally  treated  by  means  of  a  superimposition 
operation  than  by  means  of  the  non-overlapping  operations  exclusively. 
To  incorporate  this  operation  will  require  detailed  study  of 
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"superimposed"  characters  with  a  view  to  defining  in  detail  new  co- 
occurrence classes. 

In  addition  to  these,    we  wish  to  mention  two  further  questions 
which  arise  in  the  consideration  of  grammars  for  Chinese  characters. 
These  questions  are  of  a  more  general  nature  than  those  just  con- 
sidered in  that  they  are  not  tied  to  any  particular  grammar  or  even  to 
any  particular  model  of  language. 

The  first  of  these  is  the  question  of  well-formedness  in  general. 
That  is,    what  is  the  class  of  characters  which  we  want  our  grammar 
to  generate?     We  clearly  would  not  be  satisfied  with  a  grammar  which 
generated  all  the  characters  in  Mathews  [6]   and  no  more,    for  it  would 
have  no  predictive  power  and  thus  would  not  be  able  to  account  for 
those  characters  which  Mathews  omits,    those  coined  since  its  publi- 
cation and  those  which  will  be  coined  in  the  future. 

This  is  a  perplexing  problem  but  one  which  has  been  encountered 
previously  by  linguists  working  with  (spoken)  natural  languages.     We 
seem  to  be  driven  to  the  classic  field  technique  of  informant  response. 
We  are,    in  other  words,    forced  to  rely  on  the  intuition  of  a  native 
writer  of  Chinese  as  to  what  does  and  what  does  not  constitute     a  well- 
formed  character.      Thus,    on  the  analogy  of   ifth  'Tk'vL      »    an<^  /l'j      '    we 
construct  "vj  --  which  does  not  occur  in  Mathews   --  and  offer  it  to 

our  informant  for  his  evaluation.      The  informant  accepts  it  (in  this 
case)  as  being  well-formed,    i.e.,    a  reasonable  combination  of  radicals, 
and  therefore  the  power  to  generate -4f-TJ    is  added  to  our  minimal 
requirements  for  an  acceptable  grammar. 

The   second  problem  is  that  of  partitioning  the   set  of  radicals  into 

classes  in  such  a  way  that  all  members  of  a  given  class  play  a  similar 

role  in  the  grammar.      In  other  words,    it  is  the  problem  of  finding  an 

operational  definition  of  radical.      We  decide  whether  a  group  of  strokes 
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is  one  radical  or   should  be  cut  into  two  or  more  subgroups  by  the 
following  criterion.      We  accept  a  proposed  division  of  a  group  of 
strokes  into  two  or  more  subgroups  if  and  only  if  for  each  of  the 
smaller  groups   resulting  from  the  proposed  cut  there  are  many  com- 
binations of  strokes  which  can  replace  it  without  requiring  any  change 
in  the  remainder  of  the  original  group. 

For  example,    we  consider    g£     to  be  composed  of  two  radicals 
(  A'./f     and     c£      )   rather  than  three  since  our  intuition  rebels  at  the 
thought  of  changing  one  part  of   /?  A*    without  changing  the  other.      Also 
we  would  not  divide     =       into     "-"    and     CZ        since,  although  there  are 

VX  »  to 

many  combinations  -which  can  occur  over     tj     ,    there  are  few  if  any 
other  than  12  which  can  occur  under  *s*    . 

However,    it  must  be  admitted  that  not  all  of  the  terminals  in  the 
grammar  in  Appendix  B  were  arrived  at  by  this  operational  definition 
of  radical.      The  definition  problem  remains  perplexing. 

D.       Radicals  are  Represented  by  Radical  Variants 


It  has  been  a  part  of  the  traditional  view  of  radicals  to  list  several 
variant  forms  for  some   radicals.      Thus,    for  example,     /V     and     A 


are  said  to  be  variant  forms  of  radical  "9"  —  and       4?„    and     -r      are  said 


to  be  variant  forms   of  radical  "162". 

We  accept  the  view  that  some   radicals  have  variant  forms.      How- 
ever,   we  depart  from  tradition  in  that  we  list  many  more  instances  of 
multivariant  radicals  than  occur  in  the  traditional  list.      Looked  at 
carefully,    the  data  yield  quite  a  number  of  instances  of  radical  varia- 
tion --  many  more  than  are  found  in  the  traditional  list. 


u 

Numbers  in  quotes   refer  to  radicals  of  the  traditional  list.      See 

Appendix  A. 
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A  relatively  trivial  phenomenon  is  size  variation.     When  a  radical 
such  as  "86"  stands  in  isolation  as  a  character  as  in     KJ'X      ,    it  has,    let 
us   say,    normal  size;  when  it  is  embedded  in  a  simple  character  as  in 
v^l_i  ,    it  is  smaller.      Finally,    when  it  is  embedded  in  a  more  complex 
character  as  in  .y;  [_,    ,    it  is  much  smaller.     One  of  the  problems  in  this 
research  is  to  determine  how  many  size-variants  there  are  per  radical. 

A  related  phenomenon  to  that  of  size  is  that  of  shape.     Sometimes 
when  a  radical  is  embedded  in  a  character,    not  only  is  it  reduced  in 
size,    but  also  its   shape  is  distorted.      For  example,    note  the  shape 
variation  of  radical  "75"  in  /JQ     and   ^t   . 

Of  course,    /J        and   ^^    are  still  intuitively  recognizable  as  being 
instances  of  the  same  radical.      There  are  other  cases  which  are  not 
so  clear-cut.      For  example,     "j         and    /\^   are  traditionally  considered 
variants  of  radical  "9",    but  one  feels  that  this  is  at  least  an  extreme 
case  of  radical  variation.      This  is  directly  analogous  to  the  phenomenon 
of  suppletion  in  natural  language.      An  example  of  suppletive  forms  in 
English  is   "good"  and  the  "bett-"   and  "be-"   elements  in  "better"   and 
"best.  "     These  are   similar  in  meaning  and  noncontrastive  in  distri- 
bution.     However,    they  are  phonemically  completely  dissimilar. 

One  begins  to  search  for  constraints   on  radical  variation.      A  per- 
haps sufficient  constraint  might  be  that  two  candidates  for  being  radical 


variants  of  the  same  radical  must  be  pictorially  similar.  Pictorial 
similarity  —  remains  to  be  defined  precisely,  but  an  idea  of  our  in- 
tuition about  it  can  be  seen  from  our  tentative  solution  of  the    /\  ~ 


u 

The  linguist  will  of  course,    see  in  this  the  analog  of  phonetic 
similarity  as  a  constraint  on  the  grouping  of  complementarily 
distributed  phones  into  one  phoneme. 
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/|      problem.      Both  y\    and    'f       are  grouped  into  the   same   radical, 
because  they  are  both  sufficiently  described  by  some  abstract 
structural  statement  like:     Radical  "9"   consists  of  a  northeast-to- 
southwest  stroke  which  is  joined  L-wise  to  a  non-west-to-east  stroke. 
(See  the  next  section  for  a  discussion  of  L-adjoining.  ) 

A  related  phenomenon  is  that  of  stroke- sharing.      Often,   when  two 
radicals   combine  into  a  character,    they  partially  overlap  to  the  extent 
that  one  stroke  of  each  radical  occupies  the  same   space.      A  clear  case 
of  stroke- sharing  is  in  the  character  subpart    -^Az    ,    where    X/     and  "TT 
share  a  stroke.      At  this  point  we  mention  that  the  rules  to  cover  cases 
such  as  this,    i.  e.  ,    the  rules   corresponding  to  the  morphophonemic 
rules  in  a  grammar  for  a  spoken  language,    are  tacitly  assumed  to  be 
applied  in  our  grammar  whenever  relevant.    — 

E.      Radical  Variants  are  Composed  of  Strokes 

It  is  clear  from  even  a  superficial  study  of  the  internal  structure 
of  radicals  and  their  variants  that  there  are  regularities  to  be  observed 
in  the  phenomenon  of  stroke-adjoinment.      For  example  —   ,      J         , 
and    ^     may  combine  to  produce    _^>     ,    but  they  may  not  combine  to 
produce     J        and    J-"~"    ,    for  these  latter  two  are  not  well-formed. 


1/ 

Stroke- sharing  is  of  course      closely  related  to  the  phenomenon 
of  superimposition      mentioned  above. 
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There  seem  to  be  four  common  adjoining  operations.      First,    there 
is   T  adjoinment  as  in     "T~      ,    where  and     J       combine  T-wise.      Then 

there  is  L  adjoining:  and       J    combine  L-wise  to  produce   I* 

There  is  also  X-adjoining,    where  two  strokes  intersect  to  form  a 

radical  or  radical  subpart.     An  example  is  the  combination  of   and 

I     to  form  ~|     .      Finally,    there  is  an  operation  which  does  not  actually 
adjoin  two  strokes,    but  which  rather  places  them  in  a  "near"   relation; 
and     —    are  adjoined  by  the  near-operation  in  . 

The  task  here  is  to  devise  rules  powerful  enough  to  state  the  in- 
ternal    structure  of  all  the   radicals  in  terms  of  their  constituent 
strokes   and  the  stroke  operations.      We  have  in  fact  constructed  a  very 
preliminary  finite-state  grammar  to  represent  these  stroke-adjoining 
operations.      This  grammar,    though  perhaps  promising,    does  fail  on 
two  counts.      First,    it  cannot  account  for  many  radicals  which  seem 
to  have  nearly  unique  structures  and,    second,    it  seems  to  be  able  to 
account  for  many  multiradical  structures.   —    Thus,    it  seems  to  be  at 
present  both  too  weak  and  too  strong. 

F.      Strokes  Are  Represented  By  Stroke- Variants 


The  model  does  define  this  area  of     research.      However,    we  are 
not  sure  that  this  is  a  particularly  fruitful  area.      There  seem,    in  fact, 
to  be  very  few  instances  of  strokes  with  variant  shapes.     One  possibility 


1/ 

Such  as   i3»     ,    a  traditional  radical,    but  decomposable  into     -— 
12        ,     r]     ,    and     V3     . 
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is  a  stroke  which  has  the  two  variants       I       and      y     .      It  might  be 
characterized  as  a  "vertical  hooked  stroke",    the  direction  of  the 
hook  being  predictable  from  the  environment  in  which  it  occurs.     Of 
course,    every  stroke  has   size  variants,    but  this  is  almost  a  trivial 
observation. 

G.      Stroke  Variants  are  Composed  of  Distinctive   Components 

It  is  presumably  the  case  that  each  stroke-variant  can  be  uniquely 
specified  by  a  particular  combination  of  distinctive  components.     Such 
components  and  their  values  would  perhaps  include: 
Ending:  Hooked,    Nonhooked 

(This  would  distinguish    J     from      [/ 
Direction:  North-South,    East-West,    etc.  : 

versus    - — *    . 
Size:  Large,    Small:   — —  versus     —     in     ,J~- 

H.       Concluding  Remark 

It  is  expected  that  future  work  will  provide  much  more  in  the  way 
of  results  in  the  last  four  areas.      As  future  work  is  pursued,    much 
time  will  be  devoted  to  work  with  native  informants.      The  final  step 
will  be  to  express  the  results  of  this  analysis  in  the  form  of  a  com- 
pleted formal  grammar. 
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APPENDIX   A 
LIST  OF  TRADITIONAL  214   RADICALS,   WITH  VARIANT  FORMS  SHOWN 
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J 


15 


16 


17 


18 
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JL 


19 


20 


21 


22 


23 


24 


25 


26 
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71 


fl 


28 


29 


30 


31 


32 
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48 
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57 


58 


dj 
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rp 
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J3_ 
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\1 
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3 


87 
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89 


90 


91 


92 
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it 
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94 
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95 


96 


7\ 
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//N 


# 
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/I 


^  Jf 


97 


98 


99 


100 


101 


102 


103 


104 


105 


? 
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V 


107 


108 


109 


10 


4 
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in 

112 


113 
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i+ 
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114 


115 


116 


117 


118 


>^ 


119 


/  120 


137 


121 


122 


\1 


\M 
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/v 


\/ 


123 


124 


125 


126 


127 


128 


129 


130 


A- 
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/-A 
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131 


132 


IJ77 

3r 
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ifi 
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1*1 


ft. 


g 


133 


134 


135 


£ 

148 

^ 

n 

£3 

149 

CZ 

164 

£ 

150 

& 

165 

ffi 


£ 


177 


178 


179 


* 


194 


i,       195 


196 


t 


.4 

i  n  l 


^ 


136 


# 


151 


166 


v:Z 


SL 


180 


197 


0 


137 


4 


152 


167 


^ 


4 


181 


5 


198 


I 


/£*. 


138 


139 


140 


& 


s 

ft 


153 


154 


155 


168 


% 

g 
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169 


K. 


A 


PI 


182 


183 


184 


;n 


4 

ft. 


199 


200 


201 


^ 


w 


*-/- 


156 
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170 


185 


~t 


If 


202 


141 


,_  157 


a 


IT 


186 


4 


203 


142 


143 


£ 
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158 


159 
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$ 


172 


* 


IE 


187 


188 


f 


204 
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■Mi. 


m 


144 


145 


160 


173 


*J 

f 

^L 

161 

J! 

% 
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1 

174 


* 
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Pi 
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H 
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A 


146 


147 


l5 


& 


163 


Z. 


1^7 
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176 


#- 


IS 


192 


193 


209 


5         210 


PH 


7T 


^k 


211 


212 


213 


214 


n  an 
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Grammars  of  this  type  are  fully  explained  on  pages  6,    7,    and  8. 
The  numbers  which  appear  in  the  left -most  column  of  the  lexical  rules 
refer  to  the  number  of  strokes  per  radical  in  the  list  of  radicals  below 
the  number. 

Grammar  of  radical -combination 
Syntactic  Rules 
COMP;  IT 

v  (N,S);  h(W,  E);  S(0,  CHAR) 
COMP;  NT 


CHAR 
COMP 

N 
S 

w 

E 
O 


COMP;  ST 
COMP;  WT 
COMP;  ET 
OT 
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LEXICAL   RULES 
I.  WT       ET        NT        ST        OT        IT  WT       ET        NT       ST       OT        IT 
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